This example uses the tidyverse suite of packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
The code chunk below reads in the final project data.
df <- readr::read_csv("paint_project_train_data.csv", col_names = TRUE)
## Rows: 835 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Lightness, Saturation
## dbl (6): R, G, B, Hue, response, outcome
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df %>% glimpse()
## Rows: 835
## Columns: 8
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ response <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, 15,…
## $ outcome <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,…
The data consist of continuous and categorical inputs. The
glimpse() shown above reveals the data type for each
variable which state to you whether the input is continuous or
categorical. The RGB color model inputs, R, G,
and B are continuous (dbl) inputs. The HSL color model
inputs consist of 2 categorical inputs, Lightness and
Saturation, and a continuous input, Hue. Two
outputs are provided. The continuous output, response, and
the Binary output, outcome. However, the data type of the
Binary outcome is numeric because the Binary outcome is
encoded as outcome = 1 for the EVENT and
outcome = 0 for the NON-EVENT.
df %>% ggplot(aes(x = R)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = G)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = B)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Faceting against categorical variables does not give us notable results.
df %>% ggplot(aes(x = R)) + geom_histogram() + facet_wrap(~Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = R)) + geom_histogram() + facet_wrap(~ Saturation)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = R)) + geom_histogram(aes(fill = Saturation)) + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = Hue)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Again faceting against categorical variables does not give us notable results
df %>% ggplot(aes(x = Hue)) + geom_histogram() + facet_wrap(~ Saturation)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = Hue)) + geom_histogram() + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(x = Hue)) + geom_histogram(aes(fill = Saturation)) + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% ggplot(aes(y = Saturation)) + geom_bar()
df %>% ggplot(aes(y = Lightness)) + geom_bar()
lf <- df %>%
tibble::rowid_to_column() %>%
pivot_longer(c(R, G, B, Hue))
lf %>% ggplot(aes(x = name, y = value)) + geom_boxplot() + facet_wrap( ~ Lightness)
lf %>% ggplot(aes(x = name, y = value)) + geom_boxplot() + facet_wrap( ~ Saturation)
lf %>% ggplot(aes(x = name, y = value)) + geom_boxplot() + facet_wrap( ~ outcome)
df %>%
select(R, G, B, Hue) %>%
cor() %>%
corrplot::corrplot(type = 'upper',
method = 'color')
df %>%
tibble::rowid_to_column() %>%
pivot_longer(c(R, G, B)) %>%
ggplot(aes(Hue, value)) +
geom_point(aes(color = name), alpha = 0.3, size = 0.5) +
geom_smooth(aes(group = name), method = "glm", size = 0.5) +
scale_color_manual(values = list(
R = "red",
B = "blue",
G = "green"
)) +
facet_wrap(~ Lightness)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
df %>%
tibble::rowid_to_column() %>%
pivot_longer(c(R, G, B)) %>%
ggplot(aes(Hue, value)) +
geom_point(aes(color = name), alpha = 0.3, size = 0.5) +
geom_smooth(aes(group = name), method = "glm", size = 0.5) +
scale_color_manual(values = list(
R = "red",
B = "blue",
G = "green"
)) +
facet_wrap(~ Saturation)
## `geom_smooth()` using formula = 'y ~ x'
From the plots above, while we didnt see much correlation between RGB and Hue, by plotting RGB against Hue, facetted by the categorical variables we see more clear (linear) trends. Further, Lightness seems to give us better trends than Saturation.
We can highlight this better when we choose values for saturation for the correlation plot.
create_corr_mat <- function(f) {
df %>%
filter(Lightness == f) %>%
select(R, G, B, Hue) %>% cor() %>%
as_tibble() %>%
tibble::rowid_to_column() %>%
pivot_longer(c(R,G,B,Hue)) %>%
mutate(value = round(value, digits = 3))
}
tibble::tibble(
Lcat = unique(df$Lightness),
cors_long = purrr::map(Lcat, create_corr_mat)
) %>%
unnest(cors_long) %>%
mutate(name = factor(name, levels = c('R','G','B','Hue'))) %>%
ggplot(aes(y = name, x = rowid, fill = value)) +
geom_tile() +
geom_text(aes(label = value), color = "white", size = 2) +
scale_fill_gradient2(high = "#053060",
mid = "#FFFFFF",
low = "#760C20") +
scale_x_reverse() +
facet_wrap(~ Lcat)
For the RGB color scheme, I plot the resposne against
value.
lf %>%
filter(name != 'Hue') %>%
ggplot(aes(x = value, y = response)) +
geom_point(aes(group = Lightness, color = Lightness), alpha = 0.2) +
geom_smooth(aes(group = Lightness, color = Lightness), size = 0.5, method = 'glm') +
facet_wrap(~ name)
## `geom_smooth()` using formula = 'y ~ x'
lf %>%
filter(name != 'Hue') %>%
ggplot(aes(x = value, y = response)) +
geom_point(aes(group = Saturation, color = Saturation), alpha = 0.2) +
geom_smooth(aes(group = Saturation, color = Saturation), size = 0.5) +
facet_wrap(~ name)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is an increasing trend in response with respect to the RGB
color scheme. When we facet the plot against Lightness, we
see that lightness values divide the response values into levels.
Whereas, when we facet against Saturation, we see the
saturation values indicate the deviations from the trend.
For HSL color scheme, I plot Hue against response. In the first plot,
I group by Lightness and facet by Saturation
and vice versa.
lf %>%
filter(name == 'Hue') %>%
ggplot(aes(x = value, y = response)) +
geom_point(aes(color = Lightness), alpha = 0.4) +
geom_smooth(aes(group = Lightness, color = Lightness), size = 0.5, alpha = 0.2) +
facet_wrap(~ Saturation)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
lf %>%
filter(name == 'Hue') %>%
ggplot(aes(x = value, y = response)) +
geom_point(aes(color = Saturation), alpha = 0.4) +
geom_smooth(aes(group = Saturation, color = Saturation), size = 0.5, alpha = 0.2) +
facet_wrap(~ Lightness)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We see that each combination has levels based on the grouping
categories. From the two plots above, we can see that the
Saturation of pure and Lightness
of pale gives the best responses. Following is a plot with
the corresponding RGB values.
lf %>%
filter(name != 'Hue', Saturation == 'pure', Lightness == 'pale') %>%
ggplot(aes(x = value, y = response)) +
geom_point(alpha = 0.2) +
geom_smooth(size = 0.5, method = 'glm') +
facet_wrap(~ name)
## `geom_smooth()` using formula = 'y ~ x'
When we see the y-axis values, it is clear that the response is very high for the above mentioned constraints.
We apply logit transformation.
lf %>%
filter(name != 'Hue') %>%
mutate(y = boot::logit( (response - 0) / (100 - 0) ) ) %>%
ggplot(aes(x = value, y = y)) +
geom_point() +
facet_wrap(~ name)
lf %>%
filter(name == 'Hue') %>%
mutate(y = boot::logit( (response - 0) / (100 - 0) ) ) %>%
ggplot(aes(x = value, y = y)) +
geom_point(aes(color = Lightness), alpha = 0.4) +
geom_smooth(aes(group = Lightness, color = Lightness), size = 0.5, alpha = 0.2) +
facet_wrap(~ Lightness)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
lf %>%
filter(name == 'Hue') %>%
mutate(y = boot::logit( (response - 0) / (100 - 0) ) ) %>%
ggplot(aes(x = value, y = y)) +
geom_point(aes(color = Lightness), alpha = 0.4) +
geom_smooth(aes(group = Lightness, color = Lightness), size = 0.5, alpha = 0.2) +
facet_wrap(~ Saturation)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
However, we don’t observe noticeable impact on the plots.
I first use HSL colors to see if there is a relation between
outcome and Hue.
lf %>%
filter(name == 'Hue') %>%
ggplot(mapping = aes(x = value, y = outcome)) +
geom_jitter(height = 0.02, width = 0) +
geom_smooth(formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial'))
The first plot reveals that mostly the outcome is 0. I
try to facet the plot according to the categorical varaibles
Saturation and Lightness.
lf %>%
filter(name == 'Hue') %>%
ggplot(mapping = aes(x = value, y = outcome)) +
geom_jitter(mapping = aes(color = Saturation), height = 0.02, width = 0) +
geom_smooth(aes(group = Saturation, color = Saturation), formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial')) +
facet_wrap(~ Saturation)
We see some trends, especially that outcome is 1 when
Saturation is "gray".
lf %>%
filter(name == 'Hue') %>%
ggplot(mapping = aes(x = value, y = outcome)) +
geom_jitter(mapping = aes(color = Lightness), height = 0.02, width = 0) +
geom_smooth(aes(group = Lightness, color = Lightness), formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial')) +
facet_wrap(~ Lightness)
Here we some more relations but nothing concrete.
Next, I will use the RGB color scheme to visualize
outcome.
lf %>%
filter(name != 'Hue') %>%
ggplot(aes(x = value, y = outcome)) +
geom_jitter(aes(color = name), height = 0.02, width = 0) +
geom_smooth(aes(group = name, color = name), formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial')) +
scale_color_manual(values = list(
R = "red",
B = "blue",
G = "green"
)) +
facet_wrap(~ name)
We see some variation in the trends so lets checkout effects of the categorical inputs.
lf %>%
filter(name != 'Hue') %>%
ggplot(aes(x = value, y = outcome)) +
geom_jitter(aes(color = Saturation), height = 0.02, width = 0, alpha = 0.2) +
geom_smooth(aes(group = Saturation, color = Saturation), formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial'), alpha = 0.1) +
facet_wrap(~ name)
lf %>%
filter(name != 'Hue') %>%
ggplot(aes(x = value, y = outcome)) +
geom_jitter(aes(color = Lightness), height = 0.02, width = 0, alpha = 0.2) +
geom_smooth(aes(group = Lightness, color = Lightness), formula = y ~ x + I(x^2), method = 'glm', method.args = list(family = 'binomial'), alpha = 0.1) +
facet_wrap(~ name)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
We see that Saturation divides the outcome into levels
from lowest probability of outcome to highest probability of outcome,
whereas Lightness creates multimodal Gaussian like
distributions.
Finally, I visualize just the effects of the categorical variables on
the outcome.
df %>%
ggplot(aes(y = outcome)) +
geom_bar() +
facet_grid(Lightness ~ Saturation)
Finally, I wanted to see if i can visualize the colors we were working with.
df %>%
mutate(rgb_val = rgb(R,G,B,maxColorValue = 255) ) %>%
mutate(x_val = rgb_val %>% substr(2,7) %>% strtoi(base = 16)) %>%
mutate(y = boot::logit( (response - 0) / (100 - 0) ) ) %>%
ggplot(aes(x = log(x_val), y = y)) +
geom_point(aes(color = rgb_val)) +
guides(color = "none")
df %>%
mutate(rgb_val = rgb(R,G,B,maxColorValue = 255) ) %>%
mutate(x_val = rgb_val %>% substr(2,7) %>% strtoi(base = 16)) %>%
ggplot(aes(x = log(x_val), y = outcome)) +
geom_jitter(aes(color = rgb_val), height = 0.02, width = 0) +
geom_smooth(formula = y ~ x + I(x*2), method = 'glm', method.args = list(family = 'binomial')) +
guides(color = "none")
Now that we have completed the data visualization, I will transform the (continuous) data variables to remove skew and multiple modes.
I will first test logit transformations of all the variables. This is
because we know the domain of the variables. R,
G, B \(\in\)
[0,255] while Hue \(\in\)
[0,360].
R does really well and we can see more central
tendencies. On faceting with Lightness, we get more
Gaussian looking trends.
df %>% mutate(logit_R = boot::logit(R/255)) %>% ggplot(aes(x = logit_R)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).
df %>% mutate(logit_R = boot::logit(R/255)) %>% ggplot(aes(x = logit_R)) + geom_histogram() + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (`stat_bin()`).
G does not do so well immediately. However, it shows
better trends with faceting against Lightness.
df %>% mutate(logit_G = boot::logit(G/255)) %>% ggplot(aes(x = logit_G)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% mutate(logit_G = boot::logit(G/255)) %>% ggplot(aes(x = logit_G)) + geom_histogram() + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
B shows similar trends.
df %>% mutate(logit_B = boot::logit(B/255)) %>% ggplot(aes(x = logit_B)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% mutate(logit_G = boot::logit(G/255)) %>% ggplot(aes(x = logit_G)) + geom_histogram() + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Hue becomes skewed after the logit transform. Faceting
against the individual categories does not make a big difference
df %>% mutate(logit_Hue = boot::logit(Hue/360)) %>% ggplot(aes(x = logit_Hue)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% mutate(logit_Hue = boot::logit(Hue/360)) %>% ggplot(aes(x = logit_Hue)) + geom_histogram() + facet_wrap(~ Lightness)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>% mutate(logit_Hue = boot::logit(Hue/360)) %>% ggplot(aes(x = logit_Hue)) + geom_histogram() + facet_wrap(~ Saturation)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Faceting and coloring too does not reveal much trend.
I am satisfied with the results of logit. I am going to
further transform the Hue using Box-Cox transform.
df %>%
mutate(Hue_logit = -1*boot::logit(Hue/360)) %>%
mutate(Hue_logit_BC = car::boxCoxVariable(Hue_logit)) %>%
ggplot(aes(x = Hue_logit_BC)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df %>%
mutate(R_BC = car::boxCoxVariable(R)) %>%
ggplot(aes(x = R_BC)) +
geom_histogram(bins = 10)
## Warning: Removed 835 rows containing non-finite values (`stat_bin()`).
That didnt work :(
Finally I will visualize the logit output response with logit transformed inputs.
df %>%
mutate(y = boot::logit(response/100)) %>%
mutate(R_logit = boot::logit(R/255)) %>%
mutate(G_logit = boot::logit(G/255)) %>%
mutate(B_logit = boot::logit(B/255)) %>%
mutate(Hue_logit = boot::logit(Hue/360)) %>%
select(y, R_logit, G_logit, B_logit, Hue_logit) %>%
tibble::rowid_to_column() %>%
pivot_longer(c(R_logit, G_logit, B_logit, Hue_logit)) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_point() +
facet_wrap(~ name)
This is really good because we see extreme linearity against
R, G, and B inputs.
Understandably Hue alone would not be sufficient to predict
the response.